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ABSTRACT 

We study algorithms for matching user tracks, consisting of 
time-ordered location points, to paths in the road network. 
Previous work has focused on the scenario where the loca- 
tion data is linearly ordered and consists of fairly dense and 
regular samples. In this work, we consider the multi-track 
map matching, where the location data comes from different 
trips on the same route, each with very sparse samples. This 
captures the realistic scenario where users repeatedly travel 
on regular routes and samples are sparsely collected, either 
due to energy consumption constraints or because samples 
are only collected when the user actively uses a service. In 
the multi-track problem, the total set of combined locations 
is only partially ordered, rather than globally ordered as re- 
quired by previous map-matching algorithms. We propose 
two methods, the iterative projection scheme and the graph 
Laplacian scheme, to solve the multi-track problem by using 
a single-track map-matching subroutine. We also propose a 
boosting technique which may be applied to either approach 
to improve the accuracy of the estimated paths. In addition, 
in order to deal with variable sampling rates in single-track 
map matching, we propose a method based on a particular 
regularized cost function that can be adapted for different 
sampling rates and measurement errors. We evaluate the ef- 
fectiveness of our techniques for reconstructing tracks under 
several different configurations of sampling error and sam- 
pling rate. 

Categories and Subject Descriptors 

1.5.1 [Computing Methodologies]: Pattern Recognition, 
Statistical Models 
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1. INTRODUCTION 

Map matching is the procedure for determining the path of 
a user on a map from a sequence of location data (which we 
refer to as track). This process serves as a common prepro- 
cessing step for reasoning about traffic on the road network 
as well as for providing better location-based services [4] 
|12| |13| [T] |18| . Converting a track to a topological path on a 
map not only makes it easier to reason about paths, but also 
leads to reduced storage requirements and more efficient op- 
erations on the same. Some examples of such operations are 
comparison of tracks, which is significantly more efficient to 
do on a discrete road network than as polygonal lines on the 
plane, querying and retrieval of tracks based on similarity 
to other tracks |9j. 

Due to its importance, many methods for map matching 
have been previously proposed [4} |12|[l5}|13||18| [5], focusing 
on matching a single sufficiently dense and accurate sequence 
of locations. In this work, we consider the multi-track map 
matching problem, where we are given a number of tracks 
generated from trips through the same path, and we wish 
to recover the underlying path that generates these tracks. 
The problem is much more challenging than the single-track 
problem since each track contains a small number of samples 
(i.e. sampling intervals are large). This captures the realis- 
tic scenario where users repeatedly travel on regular routes, 
and samples are sparsely collected due to restraints in en- 
ergy consumption on the mobile device. Another scenario 
is when a service collects users' location information only 
when the user actively uses such service. In such a scenario, 
the collected location data can be very sparse. However, 
since users typically travel on the same (or similar) routes 
repeatedly, one may compensate the sparseness of the data 
by combining location data from different trips. 

The main challenge in multi-track map matching is that 
in combining data from multiple tracks, global ordering on 
all samples is not available, a necessary condition for apply- 
ing existing single-track map matching algorithms. Instead, 
each track only provides the order on a subset of locations. 
If we apply the map matching algorithm on each individual 
track, we would obtain paths with very poor quality given 
the low sampling rate on each path. 



In this work we present multi-track map matching algo- 
rithms in which partial orders from input tracks are "aggre- 
gated" to form an appropriate global order, after which we 
can reduce the problem to single-track map matching. 

Given its role in our multi-track map matching approaches, 
we also revisit single-track map matching in order to un- 
derstand how to fine-tune it for scenarios when the mea- 
surement error is high, and /or when the sampling rate of 
tracks is too high or too low. We study an approach to 
map-matching that involves minimizing a regularized cost 
function including two types of errors in measuring the qual- 
ity of a path: the data error (fidelity of the path to the data 
points), and the model error (measuring the "niceness" of 
the obtained path). The two terms are balanced accord- 
ing to a regularization parameter, and we study the optimal 
choice of regularization parameter through both theoretical 
and empirical studies. We show that, for a simplified model, 
there is an optimal choice of the regularization parameter. 
We also verify, through experiments, that the algorithm be- 
haves as predicted by our theoretical study. 

In summary, our main contributions are 

• We study the multi-track map matching problem, and 
propose two methods for solving it. Our general ap- 
proach consists of (a) merging multiple tracks into a 
single one and (b) running it through a single-track al- 
gorithm. We propose two methods for merging tracks: 
an iterative projection scheme and a graph Laplacian 
scheme. We also propose a generic framework to re- 
move outliers by aggregating the matching results from 
multiple sub-samples. 

• We revisit the single-track map matching problem, for- 
mulate an optimization problem and prove rigorously 
that the solution to the problem achieves optimal path 
reconstruction in terms of the minimax risk for a sim- 
plified model. While the optimization problem resem- 
bles previous hidden Markov model (HMM) methods, 
our approach allows a principled way to adapt its pa- 
rameter according to properties of the input data. 

• We evaluate the effectiveness of our proposed tech- 
niques for reconstructing tracks under several differ- 
ent configurations of sampling error and sampling rate. 
The evaluations are done on the dataset available in [ll] . 
The data set contains tracks collected from real users 
in Seattle, WA, using commercially available consumer 
grade GPS device. Our results indicate that the pro- 
posed approaches lead to reasonable estimates of the 
route, significantly better than what would be achieved 
in case tracks were map matched individually. 

2. RELATED WORK 

Map matching has become an increasingly important prob- 
lem over the past few years due to the proliferation of GPS 
tracking devices and track-based applications. A number of 
algorithms have been proposed to address this problem [15| 



13 1 12| |18| [5j. A class of these works contain statistical 
methods which are based on Bayesian estimators. The au- 
thors in [13| |15| use the fact that the actual positions of the 
user on a path form a Markov chain. Given the location 
measurements, a hidden Markov model is defined with the 
actual positions on the path as the hidden states. The mea- 
surement probabilities in the model are determined based 



on the location measurement noise distribution (normal dis- 
tribution with mean zero and variance a 2 ); the transition 
probabilities are determined based on the spatial geometric 
and topological restrictions along with the temporal/speed 
constraints of the trajectories. The matched path is the one 
with maximum posterior probability. 

Another line of research does not use any statistical meth- 
ods to address the problem. Authors in [5] use curve sim- 
plification for approximating the Frechet distance of curves. 
Given a polygonal curve ir and an embedded graph G with 
edges embedded as straight line segments, they attack the 
problem of finding the closest path in G to the curve n with 
respect to the Frechet distance. In [2], the average Frechet 
distance is used to reduce the effect of outliers. 

One of the contributions of our work is to introduce a 
graph Laplacian-based scheme for the multi-track map match- 
ing problem. Though graph Laplacians are widely used in 
machine learning for dimensionality reduction [3j [17] , spec- 
tral clustering [l4 16 and semi-supervised learning [2], us- 
ing them for map matching is a contribution of the present 
paper. Given the partial orderings on the locations, we con- 
struct an appropriate distance matrix and use the Laplacian 
of the corresponding weighted graph to find a global order- 
ing of the locations. 

3. PROBLEM STATEMENT 

In this section, we define some notations and formally 
present the map matching problem. We assume that the 
user traverses a path T on the road network with some 
bounded velocity. At time instants ti, • • ■ ,t n , her location is 
recorded by the GPS device or obtained by other localization 
methods. Each measured data consists of a time-stamped 
latitude/longitude pair, which is subject to some noise. De- 
note the actual location of the user at time tj by jj and let 
jj be the measured location at time tj (jj, jj £ K 2 ). 

The location noise is distributed as a zero-mean Gaussian 
vector with variance a 2 , i.e., 



N(0, 







We call the time-stamped sequence F = (71,..., 7„) a 
track. Figure [lja) shows a portion of a path F, and (b) 
shows the corresponding track. In this paper, we consider 
the following two problems. 

• Single-track map matching. We are given a single track 
F. Since locations are time-stamped, there exists a 
global ordering on the locations in the time domain. 
The aim is to reconstruct the path F from F. 

• Multi-track map matching. Several user tracks Ti, ■ • ■ , 
r m are available, all generated from traveling on a sin- 
gle path F. This models the scenario where the tracks 
are collected over different days or from different users 
traveling across F. The goal here is to use all the tracks 
to recover the path. Note that in this case, there are 
only partial orders on the locations in the time domain. 

The map matching problem becomes more challenging 
when the location measurement error is high and/or when 
the sampling rate is too high or too low. Furthermore, for 
the same number of sample points and the same measure- 
ment error, the multi-track map matching is inherently more 
challenging as it lacks a global ordering of the points. 



In solving the map matching problem, we implicitly as- 
sume that the user tends to travel on the shortest (quickest) 
path. This is an important assumption that facilitates find- 
ing good matches. 

In order to evaluate the quality of a map matching algo- 
rithm, we need a way to measure the similarity between two 
paths on the map. We can view each path F as the set of 
road segments it contains. We define the similarity between 
two paths as 



sim(r Xl r 2 ) = 



£(Ti nr 2 ) 
^(riur 2 ) 



(i) 



where for a set S of road segments, £(S) denotes the total 
length of the road segments in S. Notice that this measure 
captures both false negative and false positive segments on 
the matches. In particular, sim(ri, r 2 ) < 1 and sim(Fi, T 2 ) = 
1 iff both paths consist of the same set of road segments. 
Our metric is slightly different from some previously used 
metrics. For example, in [13] , the similarity measure ignores 
false negatives which results in a more lenient metric than 
ours. 

In the following, we denote the Euclidean distance by || ■ || 
and the shortest path distance by || ■ For a path P and 
two points x,y G P, we denote the length of P between the 
points x,y by ||a;j/||p . 

4. SINGLE-TRACK MAP MATCHING 
4.1 Algorithm 

Our method is based on minimizing a regularized cost 
function that balances two types of errors in measuring the 
quality of a path: the data error, which measures the fidelity 
of the path to the data points, and the model error, which 
measures the "niceness" of the obtained path. 

Formally, our method maps each jj to a point Xj on the 
road network. The produced path then consists of con- 
secutive shortest paths that connect Xj and Xj+i for j — 
1, ... ,n — 1. For the path defined by X — (xi, . . . , x n ), the 
quality of the match is measured by the following regularized 
cost function. 



c\(fU) = X)i 



35j'Tj I 



CjZj+llld , 



(2) 



where ||ary||d represents the driving distance between two 
points x and y on the road network, i.e. the length of the 
shortest path between x and y. 

The above cost function contains two terms: the former 
measures the distance of the observed points to the path and 
corresponds to the data error. The latter measures the local 
optimality of the path and corresponds to the model error. 
The regularization parameter A balances between these two 
terms and plays a crucial role in the estimator error. 

Given the cost function C\{Y,X), the map matching al- 
gorithm finds the sequence that minimizes C\(T, X) to serve 
as the matched path. We denote the outcome as 

A X (T) = argmin x C A (f,X). 

Since finding the global minimum is difficult, our imple- 
mentation, to be described later, actually finds an approxi- 
mate solution. 

The cost function C\(-, •) is very similar to what has been 
used in previous methods based on Hidden Markov Models 



(HMM) [12| |15| |13] (with minor variations in modeling the 
error). However, in all previous work, A is set to a constant. 
One of the focus of our work is to study the impact of A on 
the match quality and to present guidelines for choosing the 
optimal A for a given input. 

4.2 Choosing the regularization parameter 

As discussed before, the regularization parameter A con- 
trols the weight of the terms in the cost function. We will 
show, through both theoretical and empirical studies, that 
the value of A has significant impact on the matching qual- 
ity. For the theoretical analysis, we consider a simplified 
model and present an asymptotically optimal choice of A 
for the model. Later, we evaluate the choice of A through 
experiments with real data and show the same trend as the- 
oretically predicted. 

For an intuitive understanding of the effect of A, consider 
the extreme cases A = and A = +00. Examples are shown 
in Fig. [IJ 

• A = 0. In this case, all weight is put on fidelity 
to the measured data. Therefore, the obtained path 
is the one passing through the projection^] of mea- 
sured data onto the nearest road. While the location 
data is important as the sole indicator of the path, 
naively matching each noisy point to the nearest road 
will result in unsuitable paths involving short loops or 
U-turns, and overall a peculiar driving behavior. 

• A = +00. In this case, the emphasis is on finding a 
quick path; As a result, lots of geometrical details in 
the original path will be missing in the recovered one. 

For a better understanding of the effect of A, we study 
a simplified model and characterize the optimal choice of A 
for it. In the model, we consider the situation where the 
sampling rate is high so the shortest path distance can be 
approximated by the Euclidean distance between the two 
endpoints. For a given regularization parameter A and T = 
(71, ... , 7„), let A' x (r) denote the optimal solution under A, 
i.e. 



A' x {t) = argmin x C;(f,X) 

n 

= axgnun x ^)||a; 3 -7i| 
3=1 



3=1 



Notice that in this model, we use the Euclidean distance 
rather than the shortest path distance in the cost function. 
To evaluate the quality of matc hing, we adopt the standard 
minimax risk framework 6, lOj. Let T = (71, . . . , y„) be the 
ground truth. We require V to satisfy the condition 



||7i7.i+i|| < cL/n, 



(3) 



where L is the total length of the path and c is a constant 
dependent on factors that upper-bound the distance between 
consecutive samples, such as speed-limit. Recall that for an 
observed sequence of locations V = (71, • •• , 7n), we have 
jj — 7j + ogj , where gj are independent standard Gaussian 
noises. Therefore, F follows the distribution N(r, crJ). We 
use the mean squared error to measure the quality of the 
output, i.e. for a match X = (xi, . . . ,x n ), let e(Y,X) = 

1 The projection of a point on a path is the nearest point on 
the path to the point. 



(a) Original Path 



(b) Measured Locations 




(c) Reconstructed Path, A = 10 2 



Figure 1: The effect of regular izat ion parameter A. A small 
A leads to missing geometrical details of the path. 

H^jTj l| 2 denote the error of X. Then the expected 
error of any map matching algorithm M under the ground 
truth P is 

£(M,r)=E f ^ N(r>CTj) e(r,M(f)). (4) 

The minimax risk of a map matching method is R{M) — 
maxr S(M, F), where P is taken among all the samples sat- 
isfying J3|. We can show that 

Theorem 4.1. With the above notation, we have 

R(A' X ) < Cl \ (J^j +c 2 ^, (5) 

for some constant Ci,c% > 0. R is minimized when A* = 
9((ncr/L) 4/3 ) with R(A' X ,) = 0((a 2 L/n) 2/3 ). Furthermore, 
for any estimator M, R(M) = n((a 2 L/n) 2/3 ) . 

The proof of the above theorem requires some involved 
analysis which we omit from this abstract. Instead, we dis- 
cuss some implications of the above theorem. 

1) Large values of A put emphasis on minimizing the path 
length and may lead to large data errors (first term in ([5|). 
Meanwhile, small values of A allow the model to become 
finely tuned to noise, potentially leading to large model er- 




(d) Reconstructed Path, A = 10' 



value of A results in strange loops, while a large value of 



rors (second term in l|5|). The optimum choice of A is a 
balance of the two error terms. 

2) The optimal A* = 0((n<r/L) 4 / 3 ) increases with either 
n or a and decreases with L. Intuitively, this means that 
we need to put more weight on the model error when the 
location measurement is noisy or when the sampling rate is 
high. We shall verify this observation later via experiments. 

3) The optimal minimax risk, 0{(a 2 L/n) 2/:i ) -> when 
n — > oo. In other words, noise and under sampling have a 
similar effect on the error. For any value of a 2 , the error 
can be made arbitrarily small by increasing the sampling 
rate. But this is not true if we choose a constant A since the 
second term then remains constant. This partly explains the 
paradoxical phenomenon, as observed in [l5], in which the 
quality of a map matching algorithm deteriorates when the 
sampling rate is very high. 

4) The algorithm based on the regularized cost function is 
asymptotically optimal (in the minimax risk order of magni- 
tude) among all the matching algorithms! We find this quite 
surprising given the simple formulation of the algorithm and 
the vast options of estimators. 

Theorem |4.1| and the above discussion applies to the sim- 
plified model in which we replace the shortest path distance 
by the Euclidean distance. However, as we shall show in our 



experiments, the above statements qualitatively hold for the 
map-matching problem. Thus they serve as good guidelines 
for choosing A. 

4.3 Estimating a 

As shown above, the optimal choice of A depends on the 
measurement noise a and the average distance between con- 
secutive samples L/n. While the latter can be easily found 
through the data, we need to estimate a from the input. We 
use a cross validation technique to estimate the parameter 
a. We divide the samples into a disjoint training set and a 
validation set. We then use the map-matching algorithm on 
the training set to obtain a path P. Next, we compute the 
distance from each point in the validation set to P, and use 
the average distance to determine a. 

To achieve better accuracy, we need to make sure that 
the training set has sufficient samples to produce a high 
quality match. This is controlled by taking a sparse subset 
of regularly spaced samples as the validation set. A more 
formal description can be found in Procedure [l] (for a path 
7' ! ' and a location jk, the notation V^a) {jk) in Procedurejl] 
denotes the projection of 7fc onto the path 7^*'). 

Procedure 1 Cross Validation for estimating a 

Input: Measured locations {7*;} for k — 1, . . . ,n. 
Output: The estimation a. 
1: Set m = |_\AvJ • 

for i £ {1, . . . , m} do 

Let Si = {ji,jm+i,j2m+i, ■■■}; 
Match a path 7^ to the points in ; 
Let 



1 . , 

,; =151 L 117*^(0 (7fe)ll 



7fcSSi 



Set <Ji — y/di/2; 
end for 

Return a = 1 /m Y^iLi &i 



One remaining question is the choice of A in the map 
matching algorithm in Procedure[T](step 4). As shown in the 
experiments, for a fairly large range of A, the reconstruction 
accuracy is still adequate to estimate a. 

4.4 Implementation 

The global optimization with respect to the cost func- 
tion |5| is possible but time consuming. We employ a prun- 
ing procedure to reduce the solution space to a smaller set of 
candidates and then apply dynamic programming to com- 
pute the minimum solution in the pruned set. The resulting 

HH Ill- 



algorithm is similar to HMM-based methods 12 



We first construct a multipartite graph G with the j 
part corresponding to the measurement 7,. For each part, 
we consider a small set of road points, which we call match 
candidates, one of which will be matched to yj . The match 
candidates should be chosen such that they represent well 
all the possible points that might have generated 7j . On the 
other hand, we need to keep its size small for fast computa- 
tion. 

In our implementation, for each 7, , we consider the road 
segments that are within 200 meters from it. On each of 
these road segments, we consider N + 1 match candidates 



including the nearest point on the road segment to 7, along 
with N other points evenly spaced on that road segment, 
(see Fig. [5] for an illustration). The role of these N extra 
candidates (per segment) is to increase the algorithm flexi- 
bility in choosing matches, especially when the location error 
is large or when the sampling rate is too high. In all of our 
experiments, we choose N = 3. To every vertex in part j, 
corresponding to point x on the map, we assign the value 
Il7^f. 

We then connect every vertex in part j to every vertex in 
part j + 1. To each edge between points x and y, we assign 
the edge weight A||a;j/|]^. We then compute the minimum 
weighted path in graph G and output the points on the 
path as the match. The weight of the path is calculated 
by summing up all the edge and the vertex weights on the 
path. We use standard dynamic programming algorithm 
to compute the optimal path in this graph. To efficiently 
compute the edge weights, we use the contraction hierarchy 
shortest path software developed by [8]. 

5. MULTI-TRACK MAP MATCHING 

In the single track map matching problem, the global or- 
dering on the locations is known a priori since timestamps 
are available for each location. In the Multi-track map 
matching problem, ordering of locations is also available for 
each individual track. However, it is not known how samples 
from different tracks are ordered with respect to each other 
when we merge them into a single track. In this section, we 
propose techniques that take samples from multiple tracks 
refering to the same route and combine them into a single 
track with total ordering. Once total ordering is achieved, 
single-track map matching may be applied to the combined 
track. We first propose two methods to obtain a global or- 
dering on locations. We then introduce a boosting process 
that further enhances the performance of both methods. 

Before proceeding to the methods, we establish some no- 
tations. For a set of measured locations S and an order tt on 
S, we denote by SMM(S', n) the outcome of the single track 
map matching algorithm described in Section [4] When S 
consists of a single track, we simply write it as SMM(S) as 
the order is already given in the track. 

5.1 Iterative projection method 

For any given path P, we define the order of a set of 
points S with respect to P as follows. We first compute, for 
each point s 6 S, the nearest point s, called the projection 
of s, on P. We then order S according to the order of s 
on P. In the iterative projection method, we choose an 
initial path and then order all the points with respect to 
the path. Once we obtain the order, we run the single-track 
map matching algorithm on the points with the computed 
order. The resulting path becomes the new candidate path, 
and we repeat the above process until either the process 
converges or it has run for too many rounds. The algorithm 
is summarized in Procedure |2] 

The quality of the algorithm depends on the choice of the 
initial path. At the extreme, if the initial path is just a point, 
then the projection method would project all the points to 
the same point and thus fails to produce any meaningful re- 
sult. A good initial path should "cover" all the points, i.e. 
pass through most points. However, we do not require the 
initial path to have very precise geometry as it will be cor- 
rected nicely during the iteration. With these constraints, 




(a) Matched Candidates for measurement (b) Graph of matched candidates 

Figure 2: We use dynamic programming to find the minimum weighted path in the graph of matched candidates. In 
Fig (a), N = 3, hence we have 4 matched candidates on each road segment ri,r2,?'3. 



we choose the starting track as the one that minimizes the 
sum of distances from samples in all the other tracks to it, 
where the distance from a sample to a track is the short- 
est distance between the sample and the track. This often 
leads to a good candidate, but occasionally we end up with 
bad initial path. The boosting process that we later describe 
helps alleviate this problem. 

In order to quickly compute the nearest point, we imple- 
mented a quad-tree [7] data structure so that for any given 
path represented by a polygonal line, we can compute effi- 
ciently, for any query point, its nearest point on the path. 



Procedure 2 Iterative projection method 

Input: a set of GPS locations S from tracks Ti, ■ • • , IV 
Output: global ordering tt on S. 

1: Choose the initial track So from Si, • ■ • , Sk 

2: Set P = SMM(So); 

3: repeat 

4: Compute the order n of S with respect to P; 
5: Set P = SMM(S,tt). 

6: until (P remains the same) or (too many rounds) 
7: return tt 



5.2 Graph Laplacian Method 

If we ignore the order of the GPS locations, the map- 
matching problem resembles the classical curve reconstruc- 
tion problem in which one is asked to reconstruct a curve 
from discrete samples on the curve. In our second approach, 
we apply graph Laplacian method, an effective method in 
machine learning [5J |17| |16| [2], to compute the order of 
the GPS locations and then apply the single track map 
matching on the resulted sequence of GPS locations. We 
will first describe the standard graph Laplacian method and 
then make modification to adapt to the particular problem 
of map matching. 

Graph Laplacian method. Graph Laplacian works in 
any metric space. Suppose that we are given n points and 
the n x n distance matrix D. The goal is to find an order of 
the n points that is consistent with D. Intuitively, an order 



is consistent if nearby points in the order are also close-by 
according to D. The graph Laplacian method computes such 
an order by first solving the following minimization problem 

minimize ( V i~ v j) 2 /Dij 

v — J 

1 <i<j <n 

subject to ||u|| — 1, (6) 

3 

where ||v|| = y/YLi v i- Given the solution v* of (pL we 
sort the values of uJ for j = 1, . . . , n to obtain an ordering 
tv of the points. The intuition of the above approach is 
quite clear: if i,j are close, i.e. Dij is small, we would then 
like \vi — Vj\ to be small in order to minimize the objective 
function, which in turn implies that i, j would be close in the 
resulted ordering. The constraints are to avoid the trivial 
solution of setting all Vj's to some constant or scaling Vj's 
to very small number. 

The optimization problem Q can be efficiently solved by 
computing the eigenvectors of Laplacian matrix L defined 
as 

T - J ~ l / D » 
This is because v T Lv = 

E,K " UjY/Da. Therefore, L 
is positive semi-definite and the minimum eigenvalue of L 
is corresponding to the eigenvector (1, . . . , 1) T . Since the 
eigenvectors of any symmetric matrix are orthogonal to each 
other, the eigenvector corresponding to the second smallest 
eigenvalue is exactly the solution to 

The quality of the Laplacian method depends crucially 
on the accuracy of D as an approximation of the distances 
along the curve. If it is exact, then the Laplacian method 
produces the exact answer. One practical heuristic is to 
replace Dij by exp(cD i:) ) so to put more weights on the 
short distances as they are more trustworthy approximation 
to the distances along the curve. In the following, we will 
describe how we form the distance matrix in order to more 
faithfully approximate the shortest path distance between 
the points. 



Constructing D for map matching. The straight for- 
ward solution would be to use the Euclidean distance be- 
tween points to form the distance matrix. However, this 
approximation can be quite poor, especially for points far 
away from each other or for points which are close together 
in the space but far away by travel (think of two points on 
the opposite banks of a river). To remedy these problems, 
we utilize the order information coming from each individ- 
ual track. Basically, when we compute the distance between 
two sample points on different tracks, we use the distance 
along the matched path to each individual track. Figure [3] 
shows an example on how the distance is computed. 




outlier points with large error. The outlier points may cause 
either method to get trapped into some wrong path. To fix 
this problem, we introduce a boosting process to improve the 
robustness of both methods against outliers. Fig. [4] shows 
the boosting scheme. 
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Figure 4: Scheme for the boosting process. 



Figure 3: Computing distance matrix. yik,ya (yjk,yji) 
are the closest points to (jj) on the path Pk and Pi, 
respectively, d^' = WliVikW + WljVjkW + \\yikVjk\\p k , dfj = 
WliVilW + WljVjlW + Ill/iltfilllPii and D tj = (d^ +df J ) )/2. 

Procedure [3] describes the details of how we compute the 
distance matrix. Recall that ||a;y||p denotes the distance 
along the path P for two points x,y £ P. 



Procedure 3 Form the Laplacian matrix 

Input: n points 71, . . . , <y n from m tracks Pi, ... , T m 
Output: D: an n x n distance matrix 

1: for i 6 {1, . . . , m} do 

2: Pi = SMM(fi) 

3: end for 

4: for i £ {1, . . . ,n} do 

5: for j g {i + 1, . . . ,n} do 

6: Suppose 74 S F k and jj G r ; ; 

7: Let yik , yu be the nearest point to ji on path Pk 

and Pi, respectively, (see Fig. [3| 
8: Let yjk,Vjl be the nearest point to 7,- on path Pk 

and Pi, respectively. 

9: 4f = \\jiyik\\ + WyjVjkW + \\yikyjk\\p k ; 
10: d$ = WiiyuW + WijyA + WyuVjiWp,; 

11: D y = (d«+d«)/2. 
12: end for 
13: end for 



We then form the Laplacian by using the exponential 
weight, 

_ / -exp(-cAi) i £ j 
3 ~ \ Ek^M-cDik) i=j 

Once we have L, we just invoke the standard Laplacian 
method, i.e. compute the eigenvector v corresponding to the 
second smallest eigenvalue of L; sort the values in v; return 
the sorted order n as the global order of the points. 

5.3 Boosting Process 

Both the iterative projection scheme and the Laplacian 
method are susceptible to noises, especially when there are 



The boosting process is fairly generic and does not depend 
on any particular ordering method. From the set S, we 
generate m subsets, namely Si, ... , S m ■ Each Si is obtained 
by sampling the data points in S with some probability p. 
Hence E|Sj| = p\S\. Then, an ordering algorithm (either the 
iterative or the Laplacian method) is used to return a global 
ordering on the set Si , denoted by Hi . The aggregation block 
in the algorithm takes the orderings 7Ti , . . . , 7r m as input and 
returns an ordering ir which is the most consistent one to 
the orderings 7Ti, • • • , n m in the following sense. 

The consistency score of two orders is defined as the num- 
ber of pairs they agree on subtracted by the number of those 
they do not. Here we do not require the two orders to be 
defined on the exactly same set of elements. When they do 
not, we only consider the elements common to both. For- 
mally, 

consul, tt 2 ) = ■■ (7Ti(f) - 7Ti(i))(7r 2 (i) -7r 2 (j)) > 0}| 

- |{(t, j) : (Mi) ~ ti(7))M0 ~ Ta(j)) < °>l- 
The aggregator then attempts to find 

/?< 

7r* = argmax > cons(7r, nk) , 
k=x 

where n is the set of all possible orderings on all subsets of 
S. The reason that we do not insist 7r to be a full order is 
that we would allow the flexibility of excluding outliers from 
the input data. 

Unfortunately, computing the most consistent order of a 
set of permutations is NP-hard. Our aggregator block uses 
heuristics to find the solution. We first create a directed 
acyclic graph (DAG) on the points which are consistent with 
the set of permutations. We then compute the longest path 
in the DAG and output it as the aggregated order. The 
details are shown in Procedure [4] 

In the description swap(7r,i,j) returns a permutation by 
swapping the elements in the i-th and j-th positions in tv. In 
the lines 4-12, we repeatedly pick a random permutation as 
a starting point and find a locally most consistent ordering 
through local searches. We repeat the process K times with 
a given K. In our experiments, we find it is sufficient to set 
K to 100. We then keep the best order among K trials. Each 
round finishes in 0(mn 2 ) iterations as each swap increases 
the score by at least 1, and the maximum score is bounded by 



Procedure 4 Order aggregation. 
Input: A set of linear ordering m, . . . , n m 
Output: A consistent order n* 
1: Let the scoring matrix M € R nxn be denned by 

Mij = \{lT k : 7Tfc(i) < 7Tfc(j)}l _ l^k ■ 7Tfc(«) > Tfc(j)}l • 

2: For any permutation tt, define 

score(Tr) = ^ Mijl {lT(t)<1T{])} ; 

3: 7Tm = 0; 

4: for fc € {l,...,if} do 

5: Pick a random permutation 7r; 

6: while 3 i,j score(swap(7r, i, j)) > score(7r) do 

7: 7T = swap(7r, i, j); 

8: end while 

9: if score(-7r) > score(7rjv/) then 
10: ■km = tt; 
11: end if 
12: end for 

13: Form G = (V, E) where € E iff My > and 

KM(i) < T/v/(j); assign weight Mj 3 - to the edge (i, j); 
14: Compute the maximum weighted directed path P in G; 

15: Let tt* be the order determined by the path P. 



0(mn 2 ). In practice, the convergence is much faster, usually 
within 0(n) iterations. Line 14 can be done in linear time 
since G is a directed acyclic graph. 

6. EXPERIMENTS 

We evaluate the performance of the proposed algorithms 
on data generated from real tracks. We will first describe the 
data generation process and then present the experimental 
results. 

6.1 Data 

We utilize dataset of tracks collected from real users in 
Seattle, WA [II]. These data are collected using commer- 
cially available consumer grade GPS device and are sampled 
with high density. To test the map matching algorithms, we 
generate synthetic data from these tracks with any given 
measurement error and sampling rate. 

More specifically, from the dataset, we remove the paths 
that are too short or too long. Table [l] summarizes some 
statistics of the paths we use. 



Table 1: Statistics of data 





Length (km) 


Duration (min) 


Segments 


mean 


14.5 


19 


77 


minimum 


3.4 


6 


20 


maximum 


38.4 


37 


169 



For a given path, we first sample in the time domain and 
determine at which time to take a sample. We then use a 
randomly generated speed on each road segment to sample 
the location on the path. Then we add the location error to 
produce a sample location. Procedure [5] describes how we 



sample from a given path with a given location error a and 
sampling interval r. 



Procedure 5 Generating sample from a path 
Input: a path P, measurement error a, sampling distribu- 
tion and rate r 
Output: a sequence 71,72, . . . 
1: Generate time steps t\,ti,... according to the required 

sampling distribution and rate r; 
2: Associate with each edge e in P a speed v e sampled 
uniformly at random from [0.814, 1.214] where 14 is the 
speed limit on the edge e; 
3: Using the speed, compute the location 7^ associated with 

each tj on path P; 
4: Let 7, = jj + ag where g is a standard Gaussian; 
5: Produce 71 , 72 , . . . with associated time stamps t\ , ti , 



For the sampling distribution, we experiment with both 
the uniform and the exponential distributions. Under the 
uniform distribution, we generate time-steps regularly at 
r, 2r, • • • ; under the exponential distribution, we generate a 
series of times where the inter-arrival time between any two 
adjacent samples follows the exponential distribution with 
mean r. The exponential distribution better models the 
location data obtained when the user uses location-based 
services. 

6.2 Single-track map matching 

In the first experiment, we run the single track map match- 
ing algorithm on the first 20 tracks in the data set and for 
various combinations of a, r and A. For each track and each 
combination of parameters, we generate 10 instances and 
take the average of the similarity of the middle 8 results. 
We then compute the average similarity over 20 tracks for 
each combination of a, r and A. The results for a = 20 and 
t — 60 are shown in Fig. [5] 

From these experiments, we can make the following ob- 
servations. 

1. The similarity result is quite stable, but the choice of 
A has a visible impact on the quality. For example, for 
a = 20 and r = 10 (the topmost curve in Figure |5| a)), 
there is a gap of 8% between the result when A = 1 
and A = 10~ 6 . 

2. The experiments show similar trends as predicted by 
Theorem |4.1| From Figure [5|a), we can see that with 
g = 20 fixed, when the sampling rate increases, i.e. n 
decreases, the optimal A decreases, shifting from 1 to 
0.01. From Figure [5|b), when fix r = 60, optimal A 
increases when a increases too. 

6.3 Multi-track map matching 

We now evaluate the performance of the different tech- 
niques proposed for multi-track map matching, i.e., the iter- 
ative and the Laplacian methods and their boosted variants. 
In the following experiments, tracks are generated using an 
exponential distribution. We collect results for 25 real routes 
taken by users, and for each route, generate a number s of 
track instances. (Different instances can be thought of as 
measurements collected over different days). 

First, we compare the four methods for computing a multi- 
track map matched route from s = 20 track instances, syn- 
thetically generated with a sampling period T — 300s and 
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Figure 5: Similarity of single-track map matching algorithm, (a) a = 20 meters, varying r and A; (b) r = 60 seconds, 
varying a and A. 



with variable measurement errors. For the boosting process, 
we choose m = 10 and p = 0.5, i.e. we run the algorithm for 
10 subsamples and in each run include each point with prob- 
ability 0.5. Therefore, a point is included in five sub-samples 
on average. We compute the similarity of the outcome rela- 
tive to the ground truth. The average results when running 
the algorithm over the 25 different routes are presented in 
Fig. [6] We removed the top and the bottom 10% similarity 
results prior to computing the average. 

From the figure, we observe that in the non-boosted ver- 
sions, the iterative and the Laplacian methods produce sim- 
ilar results, independent of the measurement error. Mean- 
while, the boosted variants of both approaches improve the 
performance of their equivalent non-boosted versions. As 
shown in Fig. [6] the boosted iterative method returns bet- 
ter matches when the measurement error is smaller, but the 
quality of the results deteriorates faster as the measure- 
ment error increases. The boosted variant of the Lapla- 
cian method outperforms all other methods once the error 
is larger than 100m. This happens because in the iterative 
method, each point is mapped to the nearest neighbor on 
another path. This makes it more sensitive to the location 
measurement errors. 

We also studied the behavior of the methods as the num- 
ber of available sample tracks per route increases. In Fig. [7] 
we vary the number of sampled tracks from 1 to 100 and 
present the average similarity between the obtained routes 
and the ground truth. In this experiment we fixed the mea- 
surement error to a = 10m and the sampling period to T — 
300s. The results show the boosted versions significantly 
outperforming the non-boosted variants once the number 
of samples is large enough (5 for the iterative method and 
20 for the Laplacian method). Once the number of sam- 
ples is larger than 45, the boosted Laplacian method starts 
outperforming the boosted iterative method. This happens 
because the Laplacian method is more robust than the it- 
erative method, and therefore the effect of boosting is only 
visible when the number of tracks is large. 

Finally, we studied the effect of the regularization param- 
eter A used in the single-track subroutine when computing 
multi-track map matching with these four techniques. In 
Fig. [8] we present the average similarity values when lambda 
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Figure 6: Average similarity when computing multi- 
track map matching with s = 20 tracks. Sampling period 
of tracks is T = 300s, and the measurement error a is 
varied from 10 to 200 meters. 
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Figure 7: Average similarity when computing multi- 
track map matching with increasing numbers of tracks 
(s = 1 to 100). Sampling period of tracks is T = 300s, and 
the measurement error is cr = 10m. 
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Figure 8: Average similarity when computing multi-track map matching with increasing numbers of tracks (s = 1 to 
50) tracks. Sampling period of tracks is T = 300s, and the measurement error is cr = 10m. 



is 10 , 1 and 10 , for all four techniques, as we increase 
the number of track instances s from 1 to 50. As observed, 
the non-boosted variants are more affected by the choice 
of lambda than their boosted counterparts. This happens 
because in the boosted version, the sub-sampling process ef- 
fectively reduces the density of the samples and therefore 
makes the results less sensitive to the change of A. 

7. CONCLUSION 

In this paper, we defined and studied the multi-track map 
matching problem, in which multiple very sparse tracks of 
location samples on a single route are combined together 
and used to recover the underlying route. We tackled this 
problem by breaking it into a two step process: We first 
computed a global ordering on the entire set of samples; once 
a global ordering on the samples is computed, the problem 
reduces to the single-track map matching. In the second 
step, we solved the corresponding single-track map matching 
problem. 

We proposed and evaluated two algorithms for obtaining a 
global ordering on the samples, namely the iterative projec- 
tion approach and the graph Laplacian approach, as well as a 
boosting process that helps remove outliers in the samples. 
Our results indicate that the proposed approaches lead to 
reasonable estimates of the route, significantly better than 



what would be achieved in case tracks were map matched 
individually. 

We also revisited the single-track map matching problem, 
which is an essential building block for the multi-track vari- 
ant. We formulated the single-track map matching problem 
as a regularized optimization problem where the regulariza- 
tion parameter can be fine-tuned based on the properties 
of the map matched data. We evaluated the effect of the 
regularization parameter under different configurations of 
measurement error and sampling rates, and showed how the 
parameter may be successfully varied to achieve best results 
under different settings. 
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